Skip to content

Pdfplumber: Integration #12949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 27, 2025

Conversation

ennamarie19
Copy link
Contributor

This pull request integrates the Dockerfile needed to build the fuzzers for pdfplumber.

Note: The fuzzers were NOT merged upstream following discussion with the project maintainer here and with the precedence for out-of-repo fuzzers established here

@ennamarie19 ennamarie19 changed the title Integration Pdfplumber: Integration Jan 18, 2025
Copy link

ennamarie19 has previously contributed to projects/pdfplumber. The previous PR was #12567

Copy link
Contributor

@DonggeLiu DonggeLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ennamarie19

@DonggeLiu DonggeLiu merged commit ad1d940 into google:master Jan 27, 2025
15 checks passed
@jsvine
Copy link

jsvine commented Jan 30, 2025

Thank you, @ennamarie19. I'm the maintainer of pdfplumber. I've started receiving results of the fuzzing via email. Some look helpful, while others appeared to be triggered by problems with a core dependency, pdfminer.six. For example, this one: https://oss-fuzz.com/testcase-detail/5914823472250880

Is there a way to set up the fuzzer to ignore errors that originate with that dependency? That would help me focus on issues I can directly fix in pdfplumber.

@ennamarie19
Copy link
Contributor Author

ennamarie19 commented Jan 30, 2025 via email

jsvine added a commit to jsvine/pdfplumber that referenced this pull request Feb 9, 2025
@jsvine
Copy link

jsvine commented Feb 9, 2025

Thanks, @ennamarie19. I've pushed a commit that handles exceptions stemming from pdfminer.six and malformed PDFs: jsvine/pdfplumber@43ccc5b

Would it make sense to have the fuzzer then ignore these particular exceptions?:

from pdfplumber.utils.exceptions import MalformedPDFException, PdfminerException

@ennamarie19
Copy link
Contributor Author

@jsvine Good idea! I've updated the scripts to catch and ignore those exceptions. However, I was fuzzing the main branch. I will do a PR here to update to fuzz the develop branch instead

@jsvine
Copy link

jsvine commented Feb 13, 2025

Great, thanks!

DenseConsulting added a commit to DenseConsulting/pdfplumber that referenced this pull request Jun 24, 2025
commit 0dd4925
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Jun 12 07:31:46 2025 -0400

    Update CITATION.cff

commit c6a24be
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Jun 12 07:23:30 2025 -0400

    Bump version to 0.11.7

commit 51f3065
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Jun 12 07:21:29 2025 -0400

    Update CHANGELOG.md

commit 738f6f0
Author: Jeremy Singer-Vine <[email protected]>
Date:   Wed Jun 11 23:40:50 2025 -0400

    Add test for CLI auto-help

commit b88907f
Author: mara004 <[email protected]>
Date:   Fri May 2 23:07:05 2025 +0200

    Minor cleanup around pypdfium2 integration

commit 7e364e6
Author: Jeremy Singer-Vine <[email protected]>
Date:   Wed Jun 11 22:24:28 2025 -0400

    Add Page.trimbox, .bleedbox, .artbox (jsvine#1313)

    Thanks to @samuelbradshaw for the suggestion!

commit 4c7e092
Author: Jeremy Singer-Vine <[email protected]>
Date:   Fri May 16 08:20:30 2025 -0400

    Upgrade pdfminer.six from 20250327 to 20250506

    ... and adjust color handling accordingly.

commit 3e0d4df
Author: Jeremy Singer-Vine <[email protected]>
Date:   Wed Jun 11 23:26:09 2025 -0400

    Run make format

commit cd6fd70
Author: nobody <[email protected]>
Date:   Mon May 19 08:31:53 2025 -0400

    Auto-add --help if CLI run w/o args

    (Commit message edited by @jsvine.)

commit 02ff431
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 23:21:17 2025 -0400

    Tiny tweaks to CHANGELOG.md

commit 8cd8e48
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 23:15:41 2025 -0400

    Bump version to 0.11.6

commit 44b078c
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 23:15:06 2025 -0400

    Update CHANGELOG.md

commit e15ed98
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 22:44:25 2025 -0400

    Fix bug w/ use_text_flow=True extractions (jsvine#1279)

    ... related to flows where text bounces between lines.

    h/t @samuelbradshaw

commit f2ad942
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 22:00:14 2025 -0400

    Add another oss-fuzz test case, already fixed

commit 748ff31
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 21:58:17 2025 -0400

    More broadly handle RecursionError, via oss-fuzz

commit 9148810
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 21:57:21 2025 -0400

    Fix unhandled None in do_PDFStream, via oss-fuzz

commit 3fcb493
Author: Jeremy Singer-Vine <[email protected]>
Date:   Thu Mar 27 21:31:06 2025 -0400

    Bump pdfminer.six to version 20250327

commit 7e28e76
Author: Jeremy Singer-Vine <[email protected]>
Date:   Tue Mar 25 23:03:13 2025 -0400

    Remove test_issue_1089 (jsvine#1263)

    @booxter makes a good point that the test is platform-specific. The
    issue has been resolved, and it's not expected to return, so I think
    provisionally OK to remove this test.

commit 630f30e
Author: Jeremy Singer-Vine <[email protected]>
Date:   Tue Mar 25 22:52:47 2025 -0400

    pragma:nocover exceptions no longer raised by pdfminer.six

commit 12a73a2
Author: Jeremy Singer-Vine <[email protected]>
Date:   Tue Mar 25 22:52:16 2025 -0400

    Bump pdfminer.six to version 20250324

commit 6349adb
Author: Jeremy Singer-Vine <[email protected]>
Date:   Mon Feb 10 22:09:28 2025 -0500

    Add escapechar for .to_csv(...)

commit 980494a
Author: Jeremy Singer-Vine <[email protected]>
Date:   Mon Feb 10 21:54:10 2025 -0500

    Use csv.QUOTE_MINIMAL for .to_csv(...)

commit 47a7ab8
Author: Jeremy Singer-Vine <[email protected]>
Date:   Mon Feb 10 21:53:17 2025 -0500

    Update exception handler

commit 8f5f498
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Feb 9 17:23:37 2025 -0500

    Fix wrong exception expectation in test

commit 43ccc5b
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Feb 9 16:23:57 2025 -0500

    Catch exceptions from pdfminer and malformed PDFs

    ... thanks to OSS-Fuzz and @ennamarie19

    Cf.: google/oss-fuzz#12949

commit a77808a
Merge: c562774 5d47d5a
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Feb 2 11:16:58 2025 -0500

    Merge pull request jsvine#1270 from mara004/patch-1

    test_issue_1089: update wording regarding pypdfium2

commit 5d47d5a
Author: mara004 <[email protected]>
Date:   Sun Feb 2 16:27:53 2025 +0100

    test_issue_1089: update wording regarding pypdfium2

    See jsvine#1089 (comment) for background

commit c562774
Author: Jeremy Singer-Vine <[email protected]>
Date:   Wed Jan 1 10:21:18 2025 -0500

    Bump version to 0.11.5

commit 4af0e1d
Author: Jeremy Singer-Vine <[email protected]>
Date:   Wed Jan 1 10:21:00 2025 -0500

    Update CHANGELOG.md

commit 7c63541
Author: Jeremy Singer-Vine <[email protected]>
Date:   Wed Jan 1 10:26:04 2025 -0500

    Add thanks to @stolarczyk in README.md

commit 078df97
Author: Jeremy Singer-Vine <[email protected]>
Date:   Tue Dec 31 09:11:32 2024 -0500

    Fix jsvine#1237 (tf → table_settings) h/t @n-traore

    And thanks to @cmdlineluser for the nudge.

commit 6e54799
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sat Dec 28 12:13:32 2024 -0500

    Add thanks to @brandonrobertz (jsvine#1235)

commit 69d010a
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Dec 15 23:24:31 2024 -0500

    Add initial test/docs for `format --text` (jsvine#1235)

commit e0ee254
Merge: 28d4f50 f3f2b57
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Dec 15 23:07:14 2024 -0500

    Merge pull request jsvine#1235 from brandonrobertz/add-text-output-mode

    Add a --format text option

commit f3f2b57
Author: Brandon Roberts <[email protected]>
Date:   Tue Dec 10 14:21:22 2024 -0800

    Add a --format text option

    I use this regularly because pdfplumber has among the best layout
    preserving methods for PDFs, especially machine generated ones.
    Exposing the page output via CLI lets me use pdfplumber as a general
    purpose PDF-to-text tool.

    Usage:

    pdfplumber --format text file.pdf > file.txt

commit 28d4f50
Merge: ea3b3e5 2073164
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Dec 8 23:10:15 2024 -0500

    Merge PR jsvine#1195

commit 2073164
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Dec 8 22:55:30 2024 -0500

    Appease linter

commit c80c78d
Author: Michal Stolarczyk <[email protected]>
Date:   Fri Nov 22 16:48:19 2024 +0100

    add a test to cover raise_unicode_errors parameter

commit 1e4b48a
Author: Jeremy Singer-Vine <[email protected]>
Date:   Fri Nov 22 08:18:11 2024 -0500

    Run 'make format' and ignore code line-length

commit 138abab
Author: Michal Stolarczyk <[email protected]>
Date:   Wed Nov 13 18:34:35 2024 +0100

    rename warn_unicode_error to raise_unicode_errors for clarity

    additionally change the default accordingly

commit ea3b3e5
Merge: 6ef62c9 8542adb
Author: Jeremy Singer-Vine <[email protected]>
Date:   Sun Nov 10 22:47:33 2024 -0500

    Merge pull request jsvine#1221 from erghelium/develop

    Fix broken link to Anssi Nurminen's master's thesis in the README.md

commit 8542adb
Author: Guilherme <[email protected]>
Date:   Sun Nov 10 18:19:04 2024 -0300

    Fix broken link to Anssi Nurminen's master's thesis in README

commit 6ef62c9
Author: Jeremy Singer-Vine <[email protected]>
Date:   Wed Oct 2 21:11:38 2024 -0400

    Add `name` property to `image` objects (jsvine#1201)

    h/t @djr2015

commit 396c5e3
Author: Michal Stolarczyk <[email protected]>
Date:   Fri Aug 30 10:24:39 2024 +0200

    warn on unicode decoding errors in PDF annotations

    in some cases the the annotations may contain some junk that hinders annotations processing altogether. This allows to ignore the error and warn instead, which is configurable via warn_unicode_error arguments in the PDF initializer and/or open() method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants